This Page Is Inserted by IFW Operations 
and is not a part of the Official Record 



BEST AVAILABLE IMAGES 



Defective images within this document are accurate representations of 
the original documents submitted by the applicant. 

Defects in the images may include (but are not limited to): 



BLACK BORDERS 

TEXT CUT OFF AT TOP, BOTTOM OR SIDES 
FADED TEXT 
ILLEGIBLE TEXT 
SKEWED/SLANTED IMAGES 
COLORED PHOTOS 

BLACK OR VERY BLACK AND WHITE DARK PHOTOS 
GRAY SCALE DOCUMENTS 



IMAGES ARE BEST AVAILABLE COPY. 

As rescanning documents will not correct images, 
please do not report the images to the 
Image Problem Mailbox. 



Exhibit A 



deciphering the Message in Protein Sequences: 
Tolerance to Amino Acid Substitutions 

James U. Bowie/ John F. Reidhaar-Olson, Wendell A. Lim, 

Robert 1\ Sauer 



amino acid sequence encodes a message that deter- 
les the shape and function of a protein. This message is 
jtiiy degenerate in that many different sequences can 
le for proteins with essentially the same structure and 
ivity. Comparison of different sequences with similar 
ssages can reveal key features of the code and improve 
ierstanding of how a protein folds and how it per- 
ms its function. 



rHE GENOME IS MANIFEST LARGELY IN THE SET OF PRO- 
tcins that it encodes. It is the ability of these proteins to fold 
into unique three-dimensional structures that allows them to 
ztion and carry out the instructions of the genome. Thus, 
lprehending the rules that relate amino acid sequence to struc- 
: is fundamental to an understanding of biological processes, 
ause an amino acid sequence contains all of the information 
rssary to determine the structure of a protein (*), it should be 
sible to predict structure from sequence, and subsequently to 
r detailed aspects of function from the structure. However, both 
blcms are extremely complex, and it seems unlikely that either 
be solved in an exact manner in the near future. It may be 
sible to obtain approximate solutions by using experimental data 
implify the problem. In this article, we describe how an analysis 
dlowed amino acid substitutions in proteins can be used to 
ice the complexity of sequences and reveal important aspects of 
cture and function. 



:thods for Studying Tolerance to 
juence Variation 

here are two main approaches to studying the tolerance of an 
no acid sequence to change. The first method relies on the 
;ess of evolution, in which mutations are cither accepted or 
ztcd by natural selection. This method has been extremely 
'crful for proteins such as the globins or cytochromes, for which 
lences from many different species are known (2-7). The second 
roach uses genetic methods to introduce amino acid changes at 
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specific positions in a cloned gene and uses selections or screens to 
identify functional sequences. This approach has been used to great 
advantage for proteins that can be expressed in bacteria or yeast, 
where the appropriate genetic manipulations arc possible (3, 8-11). 
The end results of both methods are lists of active sequences that can 
be compared and analyzed to identify sequence features that arc 
essential for folding or function. If a particular property of a side 
chain, such as charge or size, is important at a given position, only 
side chains that have the required property will be allowed. Con- 
versely, if the chemical identity of the side chain is unimportant, 
then many different substitutions will be permitted. 

Studies in which these methods were used have revealed that 
proteins arc surprisingly tolerant of amino acid substitutions (2-4, 
11). For example, in studying the effects of approximately 1500 
single amino acid substitutions at 142 positions in lac repressor, 
Miller and co-workers found that about one-half of all substitutions 
were phenotypically silent (11). At some positions, many different, 
nonconscrvative substitutions were allowed. Such residue positions 
play litdc or no role in structure and function. At other positions, no 
substitutions or only conservative substitutions were allowed. These 
residues are the most important for lac repressor activity. 

What roles do invariant and conserved side chains play in 
proteins? Residues that are directly involved in protein functions 
such as binding or catalysis will certainly be among the most 
conserved. For example, replacing the Asp in the catalytic triad of 
trypsin with Asn results in a 10 4 -fbld reduction in activity (12). A 
similar loss of activity occurs in X repressor when a DNA binding 
residue is changed from Asn to Asp (13). To carry out their 
function, however, these catalytic residues and binding residues 
must be precisely oriented in three dimensions. Consequently, 
mutations in residues that are required for structure formation or 
stability can also have dramatic effects on activity (10, 14-16). 
Hence, many of the residues that are conserved in sets of related 
sequences play structural roles. 



Substitutions at Surface and Buried Positions 

In their initial comparisons of the globin sequences, Perutz and 
co-workers found that most buried residues require nonpolar side 
chains, whereas few features of surface side chains are generally 
conserved (6). Similar results have been seen for a number of protein 
families (2, 4, 5, 7, 17, 18). An example of the sequence tolerance at 
surface versus buried sites can be seen in Fig. 1, which shows the 
allowed substitutions in k repressor at residue positions that are near 
the dimer interface but distant from the DNA binding surface of the 
protein (9). These substitutions were identified by a functional 
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Rg. 1. (A) Amino acid substitutions aiiowcd in a 
short region of A repressor. The wild-type se- 
quence is shown along the center line. iVe al- 
lowed subnutxms shown above each position 
were identified by randomly mutating one to 
^f;^f ons " a time by using a cassette method 
and applying a functional selection (9). (B) The 
ftaauaj solvent accessibility (42) of the wild- 
type side chain in the protein dimcr (43) relative 
» the same atoms in an Ala-X-Ala model tripep- 
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closed posmons tdcratc a wide range of chemicaUy differentSc 
^as mdudmg hydrophilic and hydrophobic residues Sence 1 
«ems that most of the structural information in this region of die 
protein is earned by the residues that are solvent inacSe 

Constraints on Core Sequences 

for^Sn foSr* 1 * ^ id0nS ^ 10 «* ««™ly important 
SEES?" StablI,ty ' W mUSt "« l « md that 
«S? a u L • ^ OTrc s ^ ucncc ^ «* acceptable. In general 
only hydrophobic or neutral residues arc tolerated I at bun^Sn 

the hydrophobic effect to protein stability (JS»). For example 2 

™ N^temunal domain of X repressor (20). The acceptable core 
KdK^ 

Mrogen bonds and salt t^^T^SS^JSZ 

donors and acceptors in an exact geometry. Thus die re£rtoirc of 
possible structures that use a polar core would P mb7blyT™f 

bSuMcsss tsazr**- whcre * cir hydrogen 

"'-ot^c^^^ 

-dnadual sites, howoS, can be cons^b^rFofeXle 

posmon in the appropriate sequence contexts. Lara volume 
changes at mdividual buried sites have also beenTcrTedTn 



Position 

phylogenetic studies, where it has been noted that the size decreases 
and increases at interacting residues are not necessarily ^£STa 
simple complementary fashion (5, 7, 17). Rather, focal voile 

SK*-"* a ^ mmodated b Y conformational ch^ngeTb ntSy 
«de chains and by a variety of backbone movement? V 

The Informational Importance of the Core 

With occasional exceptions, the core must remain hydrophobic 
« composed of s.de chains that can assume only a limited aumbSrf 

StThtT ^' ^ Paddn « -ust be'mSed^dTouc 
stenc clashes. How important arc hydrophobicity, volume and 
stenc complementarity in detcrniining whether a given scouTnce^n 
form « acceptable core? Each faaor is essential ^^£ £2 
as a stable core ,s probably unable to tolerate unsatisfied hTcS 
bondmg groups, large holes, or steric overlaps (25). HowevLb^ 
mformanonal sense, these factors are not equivalent. For ^X Tn 
experiments in which three core residuL ^fT'~l P ' 

"Hf J^Tn ^ of i pcCt- 

oons of the 20 naturally occurring amino acids hadvolumes withii 
the range tolerated in the core, and yet most of these 
unacceptable (20). In contrast, of the sequences Z 2SS^ 
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Rg- 2. Amino acid substitu- 
tions allowed in the core of X 
repressor. The wild-type side 
wains arc shown pictorially in 
the approximate orientation 
seen in the crystal structure 
(43). The lists of allowed sub- 
stitutions at each position are 
shown below the wild-type 
side chains. These substitu- 
tions were identified by ran- 
domly mutating one to four 
residues at a time by using a 
cassette method and applying 
a runctional selection (20). 
Not all substitutions arc al- 
lowed in every sequence back- 
ground. 
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the appropriate hydrophobic residues, a significant fraction were 

mo^Tnf ' ^ h y dr °P hobicit y °f » sequence contains 

k more information about its potential acceptability in the core than 
does the total s.dc chain volume. Stcric compatibility was intermedi- 
ate between volume and hydrophobicity in informational impor- 



Implications for Structure Prediction 



At present, the only reliable method for predicting a low- 
resolution tertiary structure of a new protein is by identify^ 

(29 30). However, ,t ,s often difficult to align sequences as the level 
of sequence sunilanty decreases, and it is £meiLs impossible to 

Tbe Informational Importance of Surface Sites ^P^tt^^ 11 ^^ 

*- ' ' f CatCr *" * c of known structures, it wodd be ZvltZ 

geous to increase d,e .each of the available structural ixubrmaSy 
tmpmving methods for detecting distant sequence relations and 2 
ubsequendy ahgning these sequences based on structural prin^te 
In a normal homology search, the sequence databaJ SscaE wi± 
asingle test sequence, and every residue must be wdgn^uIT 

£ we74 r ; e r C rC !! dU ? ^P 0 ™ otfac * ^ should 

t r^ vi ^cordmgly. Moreover, certain regions of the protein 
are more likely to contain gaps than others. Bod, kinds of inform^ 
Q ° n C3n be obta ™ d fr o- sequence sets, and several LTve 



frSf ^ n ° tcd , th J at man V surfa « "tcs can tolerate a wide variety 
-ul^tr^ ^ md phobic residues. ig 

result might be taken to indicate that surface positions contain litde 
structural information. However, Bashford et al in an cxtenSe 
analysis of globm sequences (4), found a strong b£ agai™ Ze 
hy^ophobK residues at many surface positions 6 At on! le" ,, S 

o?h TT k- mP Z 5Cd by Pr ° tcin SoIubi,it y. be ««se large 
patches of hydrophobic surface residues would presumably leadfo 

ST" K 3 m ° rC ? d -™' P-tcb folding requires a 
partmonmg between surface and buried positions. Consequendy to 
achieve a unique native state without significant competition from 

deeded preference for exterior rather than interior positions As a 

X'ST^ sitcs can acccpt hydrophobic 

uaUy, but the surface as a whole can probably tolerate only a 
moderate number of hydrophobic side chains. P 

Identification of Residue Roles from 
oets of Sequences 

Often, a protein of interest is a member of a family of related 
fences. What can we infer from the pattern of allowed ° ubstitu 
oons at posmons in sets of aligned sequences generated by genetic 
or phylogcnctic methods? Residue positions that can acclnf, 
rubber of different side chains, including chargef^gWy T,ar 
residues, are almost certain to be on the protein surface ResK 

Sly to tL^- i ydr ? h ° bk ' WhC ^ varilbf or n^ a 

*ely to be buried within the structure In Fie % rhoo „ a 

XKitions in A repressor that can *^ kj^£S^ 
hown in orange and those that cannot accent h*taS£V£ 
hams are shown m green. The obligate hyLphobic virions 

vSoSr "5 ^ StrUCtUrC ' Whc ™ P-^ons that cTac«p 
ydrophdic side chains define the surface P 

Functionally important residues should be conserved in sets of 
^ve scquences, but it is not possible to decide whether a sfde cham 

^served To make this disancaon requires an independent assay of 
x*em folding. The ability of a mutant protein to maintS, TsStoy 
Wed structure can often be measured by biophysical Uniques 

Til S t Z "* ^ StmCmrc < 27 ' »)• In the latter 
nTtlrllET f u SCrCCn pr ° tCmS in mutatcd d ones for the 
tZZ^ ' f ^ pr ° tCinS are &o of sequences 

t allow formation of a stable structure can then be compared S 

ding residues being those that arc variable in the set of s"able 
Hems but mvariant in the set of functional proteTi The DNA 

etfcoTor^n^ 

r " g reSldue$ ° fhuman hormone were also 

noficd by companng the stabilities and activities of a set of 

monel with ^^T^ ^" ^ homonc ^ related 
moncs with different binding specificities. 
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hydrophmcsidecha^rL^h^u^r' ( k P 05100 ™ *« can tolerate 
in (B) Sut rhe^^.W m . oran g e - ™c same side chains are shown 

•^^ffld?^ «^own in green. These side chains are 
the 92 side chaiTfa JifN^T^ 8 P~ tem . atoms - About three-fourths of 
(D). The ^ v ^ H '; tcrmmaJ L domau ' «e included in both (B) and 
27 44) remamm 8 P 05 ' 1101 " have not been tested. Data are from (9,14, 20 
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been used to combine such information into more appropriately 
weighted sequence searches and alignments (31). These methods 
were used to align the sequences of retroviral proteases with aspartic 
proteases, which in turn allowed construction of a three-dimension- 
al model for the protease of human immunodeficiency virus type 1 
(29). Comparison with the recently determined crystal structure of 
this protein revealed reasonable agreement in many areas of the 
predicted structure (32). 

The structural information at most surface sites is highly degener- 
ate. Except for functionally important residues, exterior positions 
seem to be important chiefly in maintaining a reasonably polar 
surface. The information contained in buried residues is also 
degenerate, the main requirement being that these residues remain 
hydrophobic. Thus, at its most basic level, the key structural 
message in an amino acid sequence may reside in its specific pattern 
of hydrophobic and hydrophilic residues. This is meant in an 
informational sense. Clearly, the precise structure and stability of a 
protein depends on a large number of detailed interactions. It is 
possible, however, that structural prediction at a more primitive 
level can be accomplished by concentrating on the most basic 
informational aspects of an amino acid sequence. For example 
amphipathic patterns can be extracted from aligned sets of sequences 
and used, m some cases, to identify secondary structures. 

If a region of secondary structure is packed against the hydropho- 
bic core, a pattern of hydrophobic residues reflecting the periodicity 
of the secondary structure is expected (33, 34). These patterns can be 
obscured in individual sequences by hydrophobic residues on the 
protein surface. It is rare, however, for a surface position to remain 
hydrophobic over the course of evolution. Consequently, the am- 
pnipathic patterns expected for simple secondary structures can be 
much clearer in a set of related sequences (<$). This principle is 
Uluserated in Fig. 4, which shows helical hydrophobic moment plots 
for the Antcnnapcdia homeodomain sequence (Fig. 4A) and for a 
composite sequence derived from a set of homologous homeodo- 
main proteins (Fig. 4B) (35). The hydrophobic moment is a simple 
measure of the degree of amphipathic character of a sequence in a 
given secondary structure (34). The amphipathic character of the 
three ot-hclical regions in the Antennapedia protein (36) is clearly 
revealed only by the analysis of the combined set of homeodomain 
sequences. The secondary structure of Arc repressor, a small DNA- 
binding protein, was recently predicted by a similar method (8) and 
confirmed by nuclear magnetic resonance studies (37). 

The specific pattern of hydrophobic and hydrophilic residues in 
an amino acid sequence must limit the number of different structures 
a given sequence can adopt and may indeed define its overall fold. If 
this is true, then the arrangement of hydrophobic and hydrophilic 
residues should be a characteristic feature of a particular fold. Sweet 
and Eisenberg have shown that the correlation of the pattern of 
hydrophobic^ between two protein sequences is a good criterion 
for that structural rdatcdncss (38). In addition, several studies 
mdicatc that patterns of obligatory hydrophobic positions identified 
from aligned sequences are distinctive features of sequences that 
adopt the same structure (4, 29, 38, 39). Thus, the order of 
hydrophobic and hydrophilic residues in a sequence may actually be 
suffiaent information to determine the basic folding pattern of a 
protein sequence. 

Although the pattern of sequence hydrophobicity may be a 
characteristic feature of a particular fold, it is not yet clear how such 
patterns could be used for prediction of structure de novo. It is 
important to understand how patterns in sequence space can be 
related to structures in conformation space. Lau and Dill have 
approached this problem by studying the properties of simple 
sequences composed only of H (hydrophobic) and P (polar) groups 
on two-dimensional lattices (40). An example of such a representa- 



tion is shown in Fig. 5. Residues adjacent in the sequence must 
occupy adjacent squares on the lattice, and two residues cannot 
occupy the same space. Free energies of particular conformations are 
evaluated with a single term, an attraction of H groups By 
considering chains of ten residues, an exhaustive conformational 
search for all 1024 possible sequences of H and P residues was 
possible. For longer sequences only a representative fraction of the 
allowed sequence or conformation space could be explored The 
significant results were as follows: (i) not all sequences can fold into 
a native" structure and only a few sequences form a unique native 
structure; (u) the probability that a sequence will adopt a unique 
native structure increases with chain length; and (iii) the native 
states arc compact, contain a hydrophobic core surrounded by polar 
residues, and contain significant secondary structure. Although the 
gap between these two-dimensional simulations and three-dimen- 
sional structures is large, the use of simple rules and sequence 
representations yields results similar to those expected for real 
proteins. Three-dimensional lattice methods are also beginning to 
be developed and evaluated (41). 



Summary 

There is more information in a set of related sequences than in a 
single sequence. A number of practical applications arise from an 
analysis of the tolerance of residue positions to change. First, such 
^formation permits the evaluation of a residue's importance to the 
function and stability of a protein. This ability to identify the 
essential elements of a protein sequence may improve our under- 
standing of the determinants of protein folding and stability as well 
as protein function. Second, patterns of tolerance to amino acid 
substitutions of varying hydrophilicity can help to identify residues 
likely to be buried in a protein structure and those likely to occupy 




Fig. 4. Helical hydro- 
phobic moments calcu- 
lated by using (A) the 
Antennapedia homeodo- 
main sequence or (B) a 
set of 39 aligned homeo- 
domain sequences (35). 
The bars indicate the ex- 
tent of the helical re- 
gions identified in nucle- 
ar magnetic resonance 
studies of the Antenna- 
pedia homeodomain 
(36). To determine hy- 
drophobic moments, 
residues were assigned 
to one of three groups: 
HI (high hydrophobici- 
ty = Trp, He, Phe, Leu, 
Met, Val, or Cys); H2 
(medium hydrophobic- 

His Gly, or Scr); and H3 (tow hydrophobicity = cL, SgZ'aw, LyJ 
or Arg). For the aligned homeodomain sequences, the residues at cadi 
K? n u W ^ S °* ed hydrophobicity by using the scale of Fauchere 

and Puska (4S). Arg and Lys were not counted unless no other residue was 
found at the position, because they contain long aliphatic side chains and can 
thereby subsntute for nonpoiar residues at some buried sites. To account for 
possible sequence errors and rare exceptions, the most hydrophilic residue 
allowed at each position was discarded unless it was observed twice. The 
second most hydrophilic residue was then chosen to represent the hydropho- 
bicity of each position. An eight-residue window was used and the vectors 
projected radially every 100°. The vector magnitudes were assigned a value of 
1, O, or — I for position" "' k — * J — 1 — » — l — — 
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Fig- 5. A representation of one com- 
pact conformation for a particular 
sequence of H and P residua on a 
two-dimensionaJ square lattice. 
[Adapted from (40), with permis- 
sion of the American Chemical Soci- 
ety] 



surface positions. The amphipathic patterns that emerge can be used 
to identify probable regions of secondary structure. Third, incorpo- 
rating a knowledge of allowed substitutions can improve the ability 
to detect and align distantly related proteins because the essential 
residues can be given prominence in the alignment scoring. 

As more sequences are determined, it becomes increasingly likely 
that a protein of interest is a member of a family of related 
sequences. If this is not the case, it is now possible to use genetic 
methods to generate lists of allowed amino acid substitutions 
Consequently, at least in the short term, it may not be necessary to 
solve the folding problem for individual protein sequences. Instead, 
information from sequence sets could be used. Perhaps by simplify- 
ing sequence space through the identification of key residues, and by 
simplifying conformation space as in the lattice methods, it will be 
possible to develop algorithms to generate a limited number of trial 
structures. These trial structures could then, in turn, be evaluated by 
further experiments and more sophisticated energy calculations. 
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